31.6 The Automation of Research

391

of a particular gene (and perhaps their co-occurrence with mentions of a particular

disease). The unorthodox usage of language in many contemporary papers is difficult

enough for a human reader to interpret, let alone for artificial intelligence. 16 Hence,

whether the results of such mining are going to be useful is a moot point. There

appear to be no attempts currently to weight the value of the “ore” according to some

assessment of the reliability of any facts reported and assertions made. But these

difficulties must be weighed alongside the general growth in overall understanding

that is hopefully taking place. The edifice of reliable knowledge gradually being

erected from the bricks supplied by individual laboratories allows inferences to be

made at an increasingly high level, and these might well render largely superfluous

endless automated reworking of the mass of facts and purported facts reported in the

primary research literature.

One area in which it seems likely that something interesting could emerge is the

search for clumps or clusters of objects (which might be words, phrases, or even whole

documents) for which there is no preexisting term to describe them. Such a search

might be based on a rather abstract measure of relevance (which must, of course, be

judiciously chosen), along the lines suggested by Good (1962), and adumbrated in

Sect. 13.2. This would be very much in the spirit of the clusters emerging when the

frequencies of nn-grams in DNA are examined (cf. Sect. 17.6).

If, indeed, knowledge representation moves toward probability distributions

(Sect. 31.3), it would be of great value if text mining could deliver quantitative

appraisals of the uncertainties of reported experimental results, which would have to

include an assessment of the entire framework of the experiment (cf. Sect. 6.1.1)—

that is, the structural information, as well as of the metrical information gained from

the individual measurements (cf. Table 6.1). We seem to be rather far from achieving

this automatically at present, but the goal merits the strongest efforts, for without

such a capability, we risk being condemned to ever more fragmented knowledge,

which, as a body, is increasingly shot through with internal contradictions. 17

31.6

The Automation of Research

Much of the laboratory work required for high-throughput genomics can be auto-

mated and carried out by laboratory robots according to a strictly executed set of

instructions. In many ways this is better than carrying out the manipulations man-

ually: the robot is likely to be able to execute its instructions more uniformly and

reliably than a human experimenter. It also has the advantage that a comprehensive

16 This is mainly a consequence of the language overwhelmingly used to write papers being English,

which is not the native tongue of most scientists nowadays, and the reluctance of open-access

publishers to spend money on editing.

17 Tensor factorization analysis is an encouraging movement towards more precision in text mining

(see Roy et al. 2017 for an application to transcription factors; note that this work confines itself to

analysing the abstracts of papers rather than the full texts—a corollary of which is that all publishers

should strive to ensure that the greatest possible care is taken in ensuring the integrity of abstracts).